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Abstract: Large-scale multiple testing problems require the simultaneous 
assessment of many p- values. This paper compares several methods to as- 
sess the evidence in multiple binomial counts of p-values: the maximum of 
the binomial counts after standardization (the 'higher-criticism statistic'), 
the maximum of the binomial counts after a log-likelihood ratio transfor- 
mation (the 'Berk- Jones statistic'), and a newly introduced average of the 
binomial counts after a likelihood ratio transformation. Simulations show 
that the higher criticism statistic has a superior performance to the Berk- 
Jones statistic in the case of very sparse alternatives (sparsity coefficient 
/3 § 0.75), while the situation is reversed for /3 S 0.75. The average likeli- 
hood ratio is found to combine the favorable performance of higher criti- 
cism in the very sparse case with that of the Berk-Jones statistic in the less 
sparse case and thus appears to dominate both statistics. Some asymptotic 
optimality theory is considered but found to set in too slowly to illuminate 
the above findings, at least for sample sizes up to one million. In contrast, 
asymptotic approximations to the critical values of the Berk-,Iones statistic 
that have been developed by Wellner and Koltchinskii (2003) and Jager 
and Wellner (2007) are found to give surprisingly accurate approximations 
even for quite small sample sizes. 
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1. Introduction 

This paper is concerned with the following mixture problem: One observes 
Xi, . . . , Xn i.i.d. F and one wants to test 

Hq : F = $, the standard normal distribution function 
versus 

Hi : F = {1 - €)'i> + £<!>(■ - /i) for some e € (0, 1), fi > 0. 

Interest in this prototypical setting derives from a number of applications that 
involve large-scale multiple testing, see e.g. Donoho and Jin (2004). In the case 
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where the proportion of nonzero means is small, e = e„ = n~^, for /3 € (i, 1), 
there is the following result: Parametrize ^ = ^n = \/2r log n for r € (0, 1) and 
define the detection boundary 



If r < p* then it is impossible to detect the presence of the nonzero means 
Any test with asymptotic level a € (0, 1) can only have trivial asymptotic 
power a. On the other hand, if r > p* (/?) , then the likelihood ratio test (which 
requires the knowledge of /3 and r) at asymptotic level a will have asymptotic 
power 1, see Ingster (1997,1998) and Jin (2004). But /3 and r are unknown, so 
direct application the likelihood ratio test is not possible. Jin (2004) and Donoho 
and Jin (2004) propose to employ the higher criticism statistic 

HC; = max V^(i/n-p(i))/w'p(j)(l-p(i)), 

where pi = P(A''(0, 1) > Xj) is the p-value of Xj, and they show that HC* also 
attains the optimal detection boundary, i.e. HC* has asymptotic power 1 for all 
j3 e (5,1) and r > p*{(i)- Note that HC* does not require the knowledge of j3 
and r. 



2. Combining the evidence of multiple binomial counts 

Denote by Fn the empirical distribution function of the p- values: F„(f) := 
n Sr=i -'-(Pi — Then one sees that 

HC:= max (1) 
*e{p(i),-..,PW2)} -y/t(l-t) 

Under the null hypothesis, the p- values pi are an i.i.d. sample from ?7[0,1]. 
Thus the quantity \/n -^"1*L ^ is the standardized count of p-vahies that fall in 

the interval (0,f], and so HC* looks for an excessive number of p- values in the 
intervals (0, t\ for t G (0, 5] by considering the maximum of these standardized 
binomial counts over the intervals (0,p(j)] for i = 1, . . . ,n/2. 

While a standardized binomial random variable is a classical example to 
illustrate the convergence to a normal distribution, it is important to keep in 
mind that its long tail is not any more subgaussian: As the success probability 
moves from ^ to 0, the long tail becomes increasingly heavy, see Shorack and 
Wellner (1986, Ch. 11.1). In fact, the first several terms in HC* even have heavy 
algebraic tails, as can be seen from an argument similar to Sec. 3 in Donoho and 
Jin (2004). Since the distribution of the max depends sensitively on the tails, 
this means that standardizing the counts does not guarantee that all counts 
are treated equally. Rather, HC* gives increasingly more weight to counts with 
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smaller index i. This raises the question what effect this has on the performance 

of Hc;. 

To investigate this issue, we can compare the performance of HC* with a 
statistic that standardizes the binomial counts differently to avoid unequal and 
heavy tails. Such a standardization is given by the log-likelihood ratio transfor- 
mation. Define 



logLRn{t) 



'nF^{t) log ^^+n{l- F^{t)) log if < i < 

otherwise. 



logLRn{t) is the one-sided log- likelihood ratio statistic for testing whether the 
parameter of the binomial count nFn{t) equals t vs. whether it is larger than t. 
The log-likelihood ratio transformation possesses the important property that it 
produces clean subexponcntial tails under the null hypothesis, no matter what 
the binomial parameter t. This fact is implicit in the proof of the Chernoff- 
Hoeffding theorem, see Hoeffding (1963). One can now proceed as with HC* 
and take the maximum of the thus standardized binomial counts over the ran- 
dom intervals (0,p(j)]. This essentially yields a statistic proposed by Berk and 
Jones (1979): 

BJ+ = max logLRn,i, 

l<i<n/2 

where logLRn,i := logLRnip^i)) = (ilog^ + {n - z) log ^3^)1 (p(i) < ^). 

BJ^ was shown by Donoho and Jin (2004) to also attain the optimal detection 

boundary. Both HC* and BJ^ are special cases of a family of goodness-of- 
fit tests based on (/(-divergences that are introduced and studied by Jager and 
Wellner (2007). 

We compare the power of HC* and BJ^ against alternatives /i„ = ^2r log n 
with r = r{/3) = 1.2p*(/3) -|- 0.1 for ten equally spaced values of /3 between 0.5 
and 1. The significance level was set to 5% by estimating the exact finite sample 
critical values of HC* and BJ^ with 10^ simulations. The power of HC* and 
BJ^ was then simulated with 10^ simulations. The left plot in Figure 1 shows 
the resulting power values for sample size n = 10^, the right plot for sample 
size n = 10^. One sees that HC* has a better detection performance in the very 
sparse case /3 ^ | , while B J^ has a better performance for smaller (3. 

The preceding discussion suggests the following explanation of this result: 
Donoho and Jin (2004) observed that for (3 G [|,1) the strongest evidence 
against Hq is found near the maximum of the observations, i.e. at the smallest 
p- values. Since HC* gives more weight to smaller p- values compared to BJ^, 
HC* will have more power. But when (3 € (^, |), then the most informative 
place to look is at larger p-values, i.e. one needs to examine the count of p- 
values in the interval {0,t\ for certain t G (0, 1). Since HC* gives less weight to 
the evidence in those intervals, it suffers a performance penalty in this case. 

The simulation study also confirms the cautionary remarks in Donoho and 
Jin (2004) about the sample size required for the above asymptotic optimality 
theory to adequately assess the performance of statistical procedures. Both HC* 



imsart-generic ver. 2011/05/20 file: v2.tex date: November 3, 2011 



G. Walther/ Average Likelihood Ratio for Detecting Mixtures 



4 



0.5 0.55 O.G 0.55 0.7 0.75 0.3 0.85 0.9 0.9E 



0.5 0.55 0.5 



0.7 0.75 0.3 0.35 0.9 0.9E 



Fig 1: Power of HC; (dashed) and BJ+ (dash-dot) as a function of the sparsity 
parameter /3. The left plot shows power for sample size n = 10'*, the right plot 
for n = \{f. 



and BJ^ attain the optimal detection boundary, i.e. have asymptotic power 1 
against the alternatives considered in the above simulation study. But even for 
a sample size of one million, their detection power is quite small for a large 
range of /3 values. Moreover, the difference in power between these two optimal 
procedures is larger than the gain in power obtained by increasing the sample 
size lOOfold from n = 10'* to n = 10^. Thus it appears that the asymptotic 
optimality theory sets in too slowly to be informative for sample sizes up to at 
least a million, and it seems prudent to instead assess the performance of such 
procedures primarily via simulation studies. 

The difference in performance between HC* and BJ^ for various j3 raises the 
question whether this difference represents an unavoidable trade-off, or whether 
it is possible to improve on this overall performance. If a better performance is 
possible, how should one go about developing a better test? 

3. The average likelihood ratio statistic 

A promising approach to obtain good power uniformly in /? is a minimax test, 
which is typically constructed as a Bayes solution with respect to a least favor- 
able prior, see Lehmann and Romano (2005, Ch. 8.1). But in the context at hand, 
such a construction appears to be involved since it requires the specification a 
multivariate prior over an appropriate set of alternative distributions. 

Instead we proceed as follows: Suppose we start with an noninformative uni- 
form prior for the parameter /3 on (i, 1). Given /?, we can use knowledge about 
the problem to construct an appropriate conditional test: Donoho and Jin (2004) 
observe that for /3 € [|, 1) the most promising approach is essentially to look at 
the smallest p- value. Thus we put prior probability ^ on the likelihood ratio test 
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over the interval (0,p(i)]. For /3 e (i, |), the most promising interval to detect 
alternatives with r close to the detection boundary p*{j3) = (3 — ^ is the interval 
(0, n"^'']. Thus given such a 13, we will employ the likelihood ratio test on the 
interval (0, t] with t = n-^C^-s). If /3 ~ U{^, |), then t = n'^^'^-s) has density 
proportional to ^ on 1). Approximating the resulting posterior integral with 
the corresponding weighted sum of the and observing that the normalizing 

factor of the weights is J2i=2 \ ~ log(n/3) yields the average likelihood ratio 

1 1 1 

ALR„ = ^ii?n,i + ^E,iog(n/3)^^"'^ 



i=2 



where 



(i^) '^PH) < ^' 
1 otherwise. 



Thus LRn,i is the one-sided likelihood ratio statistic for testing whether the 
parameter of the binomial count on (0, t] equals t, evaluated at t = . 

Theorem. ALRn attains the optimal detection boundary. 

For a proof, note that it was shown in Donoho and Jin (2004) that with 
probability converging to 1 there exists an index i € {1, . . . ,n/2} such that 
logLRn,i ^ n'^, where k = k{^, r) > 0. Hence BJ^ (and HC* ) grow algebraically 
fast under the alternative. Now LRn,i = exp{logLRn i) ^ exp(n''). Thus ALR„ 
grows exponentially fast. Some informal arguments given below suggest that 
ALR„ may have a limiting distribution under Hq, but to complete the proof in 
a rigorous way it is enough to employ the upper bound ALR„ < 2cxp(BJ^) to- 
gether with BJ^ / log log n A 1 under Hq, see Jager and Wellner (2007,Thm.3.1). 
□ 

The exponential increase of ALR„ has to be taken with a grain of salt. De- 
pending on P and r, the constant «;(/?, r) may be close to zero. Then an enormous 
n is required for Li?„,i to overcome the divisor i log(n/3) if i > 2. Of course, the 
same calamity befalls BJ^ and HC*, where the polynomial needs to over- 
come a critical value of order log log n. This appears to be one of the reasons 
why the asymptotic theory is so slow to take hold. 

As discussed above, it is therefore prefcrrablc to evaluate the performance of 
ALR„ with a simulation study. Figure 2 compares the power of ALR„, HC* , 
and BJ^ in the same setting that was considered in section 2. 

One sees that ALR„ combines the good performance of HC* at larger [3 with 
the good performance of B at smaller (3 and thus results in a test that appears 
to dominate both HC* and BJ^. 

To avoid numerical difficulties when n is large, it is advisable to rewrite 
LRn,i = s^p{logLRn^i) with logLRn^i given in section 2. As above, the simu- 
lation study used a size of 5% for all three tests by estimating the exact finite 
sample critical values with 10"'' simulations. Since such a simulation may not 
be practical for larger samples, it is of interest to explore whether reasonably 
accurate asymptotic approximations are available. 
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Fig 2: Power of ALR„ (solid), HC; (dashed) and BJ+ (dash-dot) as a function 
of the sparsity parameter j3. The left plot shows power for sample size n = 10^, 
the right plot for n = 10^. 



4. Asymptotic approximations for the null distributions 

A first attempt to derive a simple large sample approximation for the critical val- 

ues of HC* and B J„ can be based on HC*/V21og \ogn — ^ 1 and B J„ / log log n — >■ 
1, which follows e.g. from Jager and Wellner (2007, Thm. 3.1). The significance 
levels obtained by using the resulting thresholds \/2 log logn and log log n for 
HC* and BJ^, respectively, are listed under 'thresh' in Table 1. One sees that 
the resulting size of the tests is very large even for n = 10^. 

A more refined approximation can be derived from results about the conver- 
gence to an extreme value distribution. In the case of HC* , this result follows 
from Jaeschke (1979) and Eicker (1979), see also Shorack and Wellner (1986, 
Ch.l6). In the case of BJ^ a proof was sketched in Berk and Jones (1979). 
Wellner and Koltchinskii (2003) note an apparent error in that sketch and give 
a rigorous proof. See also Jager and Wellner (2007, Thm. 3.1) for a unified treat- 
ment of HC* and BJ . The latter theorem establishes convergence of two-sided 
versions of BJ^ and ^(HC*)^, after centering, to an extreme value distribution 
with distribution function E^{x) = exp(— 4exp(— x)). As remarked in Shorack 
and Wellner (1986, p. 600), the two one sided versions as well as the two halves 
{i ^ n/i) are asymptotically independent. Therefore the pertinent limit for 
HC* and BJj[ considered here should be E^. The resulting approximation for 
the level a critical value for BJ^ is 

qa := loglogn+ ^logloglogn- ^ log(47r) - log(- log(l - a)), (2) 

and the corresponding approximation for HC* is \/2q^. It is known that conver- 
gence to an extreme value distribution is typically extremely slow, see Hall (1979). 
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Calibration 


thresh 


EVI 


EVII 


Statistic 


HC; 


BJ+ 


Hc: 


BJ+ 


Hc; 


BJ+ 


Nominal level in % 






5 


10 


5 


10 


5 


10 


5 


10 


n = 10^ 


44.7 


34.7 


20.8 


27.2 


7.2 


13.4 


19.6 


25.3 


6.2 


11.4 


10^ 


45.0 


34.0 


20.0 


26.2 


6.7 


12.3 


19.1 


25.1 


6.1 


11.2 


10" 


45.7 


34.4 


19.2 


25.2 


6.4 


11.7 


18.6 


24.3 


5.9 


10.9 


10^ 


45.6 


34.4 


18.4 


24.4 


6.2 


11.3 


17.9 


23.7 


5.9 


10.7 


10« 


46.0 


34.9 


18.0 


23.9 


6.2 


11.4 


17.6 


23.3 


5.9 


10.8 



Table 1 

Finite sample significance levels (in %) of HC^ and BJ^ for various asymptotic 
approximations to critical values. Based on 10^ simulations. 



Thus there would seem to be little hope that the above approximation is use- 
ful for moderate sample sizes, in particular since it involves a doubly- iterated 
(!) logarithm. But surprisingly, the simulation study in Table 1 shows that the 
above approximation (labelled 'EVI') is quite good for BJ^ even for sample sizes 
as small as n = 100. This appears to be another benefit of the clean exponential 
tails resulting from the log-likelihood ratio transformation. Unfortunately, the 
approximation does not work well for HC* , where it yields very anti-conservative 
results. 

Wellner and Koltchinskii (2003) suggest a further improvement for the ap- 
proximation to BJ^ by using the centering c^/(26^) with c„ = 2 log log n -|- 
\ log log log n — I log(47r) and b"^ = 2 log log n in place of the first three terms on 
the right hand side of (2) . The results of this approximation are labelled 'EVIF 
in Table 1 and show a further improvement for BJ^, but still not a useful out- 
come for HC*. This is presumably due to the heavy binomial tails which are 
not taken care of by the standardization in HC* . 

In connection to this it is worth pointing out that a key argument in proving 
the above limit theorems is to show that with high probability the first log^ n 
terms in HCJj and BJ^ do not contribute to the maximum, and that for the 
remaining terms a strong approximation with a Brownian bridge is applicable. In 
particular, this means that asymptotically the heavy binomial tails don't matter, 
and that the maximum will not be attained at the first few terms. But as shown 
by the simulations above and elsewhere, such as in Donoho and Jin (2004), this 
is certainly not the case for sample sizes of up to at least n = 10^, which is 
the largest sample size we could explore in a reasonable amount of time. As 
remarked in Wellner (2006, p.43) concerning the applicability of the asymptotic 
results, one needs n > 1010388 « 10^ just to get log^n < n/2. 

Next we consider ALR„ and write 



logLRn,! = 



log 



1 , ^, 1-1/n 
-h (n - 1) log ' 



1 - 



l(p(i) < 1/n) 



1 



n 



+ log(l - p(i)) + (n - 1) log(l - 1/n) l(np(i) < 1) 



Recall that imder iJg we can use the representation = Ei/{Ei + . . . +i?„_|_i), 
where {Ei] is an infinite sequence of i.i.d. Exp(l) random variables, see Shorack 
and Wellner (1986, p. 335). Thus logLRn,\ has the same distribution as a random 
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variable that converges a.s. to (—log Si + Ei — 1)1 (Si < 1) by the strong law. 
Hence 

i^„,4(=fli)'-''-"'. (3) 

Next, set T„ :— {i : < log^n/n}. Using (A. 4) in Donoho and Jin (2004) 
and (26) on p. 602 of Shorack and Wellner (1986), we get 

(n-Pii)) 

max logLRnA < max- = Op(loglogn). 

iexn «ex„ 2p(i)(l-p(j)) 

Hence on the event An ■= {#In < 21og^ n}: 

— ^ exp((op(loglogn)) ^^^^^1^^ = Op(l), 
.^«log(n/3) V V log(n/3) 

and W{Al) = P(bin(n,log^ n/n) > 21og''^ n) ^ by Chebychcv. 

For > log^ n/n one can proceed as in the proof of Thm. 3.1 in Jager and 
Wellner (2007), see also the proof of Thm. 1.1 in Wellner and Koltchinskii (2003), 
and as on p. 601 of Shorack and Wellner (1986) and first approximate the log- 
likelihood ratio process by the square of the normalized empirical process and 
then by the square of a normalized Brownian Bridge. This suggests that 



i=2 



. ,,,^M„,l w L„ := / Texp ' ' ]dt. 

i\og{n/S) logn7i/„ t \2t{l-t)J 



It is not clear whether L„ has a finite limit distribution. Simulations show that 
the quantiles of L„ increase very slowly as n increases from 10^ to 10^. Formally 

applying rHopital's rule gives lim„^.oo i„ = lim„^oo exp ^n(i-i/n) ) • 

^^'p(^ -^n(i-i]n) ) ^ 6^P(i-^^^) with Z ~N(0,1), a conjecture for the limit law 
of ALR„ would be 



«p(K,)^.<^.<.)^l^^^,l^,., (4) 

This expression reflects the fact that the beta distribution of the first order 
statistic behaves like an exponential distribution, while sufficiently larger or- 
der statistics possess a beta distribution that is closer to a normal. Of course, 

I'Hopital's rule is not applicable since lim„^co cxp 2^^(1-1/") ) "io^^ i^ot exist 
by the law of the iterated logarithm for the Brownian bridge, so even if the law 
of Ln converges, the limit does not have to be the law of cxp(|Z+ ). 

Table 2 gives the finite sample significance levels of ALR„ resulting from the 
approximation (4) in the column 'Calibration 1'. The critical values used for 
calibration 1 are 6.05 and 3.42, which were obtained from 10^ simulations of (4). 
Calibration 2 uses L„ with n = 10^ in place of exp(iZ+^). The resulting critical 
values are 6.16 and 3.60. One sees that both approximations are reasonably 
accurate, albeit somewhat anti-conservative, for the sample sizes considered. 



imsart-generic ver. 2011/05/20 file: v2.tex date: November 3, 2011 



G. Walther/ Average Likelihood Ratio for Detecting Mixtures 9 





Calibration 1 


Calibration 2 


Nominal level in % 


5 


10 


5 


10 


n = 10^ 


6.3 


12.5 


6.2 


11.7 


103 


6.0 


12.0 


5.9 


11.3 


10" 


5.8 


11.9 


5.7 


11.1 


10^ 


5.7 


11.7 


5.6 


11.0 


10® 


5.7 


11.8 


5.4 


11.0 



Table 2 



Finite sample significance levels (in %) of ALR„ for two different approximations to the 
critical values of ALR„. Based on 10® simulations. 

5. Relation to other work and open problems 

Different variations of the average likehhood ratio have been used successfully 
in other detection problems, sec e.g. Shiryaev (1963), Burnashcv and Bcgma- 
tov (1990), Diimbgen (1998), Siegmund (2001), Gangnon and Clayton (2001), 
Chan (2009) or Chan and Walther (2011), but the above weighted average like- 
lihood ratio seems not to have been considered before. 

It is worthwhile to compare the above results with the setting where the 
proportion e„ of observations with nonzero means is not scattered randomly but 
possesses structure, e.g. when e„n consecutive observations possess an elevated 
mean. Such problems are typically addressed with the scan statistic, i.e. the 
maximum likelihood ratio statistic. It was shown by Arias-Castro et al. (2005) 
that the scan can detect elevated means of size = ■\/2 logn/(e„n). Chan 
and Walther (2011) showed that the scan cannot do better than that but that 
a version of the average likelihood ratio can detect smaller means where the 
factor a/2 logn in the numerator is replaced by ■\/21og(l/e„) ~ a/2/3 logn. No 
test can improve on this latter rate. Thus the scan is optimal only in the case of 
a single elevated mean, but its performance relative to the ALR deteriorates as 
the proportion of nonzero means increases. It was also shown in Walther (2010) 
and Chan and Walther (2011) that optimality of the scan can be restored by 
employing scale-dependent critical values. Comparing with the results in the 
present paper, one sees that structure in the elevated means allows to greatly 
improve the detection power: In the case of consecutively elevated means, the 
detection boundary is lowered by a factor ~ ^e„n = VriP~^, which can be 
considerable. 

Regarding the setting in the present paper, it would be of interest to develop 
an optimality theory that allows to compare the performance of tests at more 
moderate sample sizes. Such a comparison might by possible by exploring the 
rate at which an estimator can approach the detection boundary while still 
guaranteeing consistency. See Walther (2010) and Chan and Walther (2011) for 
such an analysis in the case of consecutively elevated means. Finally, it would be 
of interest to perform a more formal investigation of a possible limit distribution 
of the average likelihood ratio. 
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